Estudo das Features

  Dado que, boa parte do desempenho e resultado das técnicas vistas em sala de aula são diretamente impactadas pela escolha ou tratamento do espaço de entrada (ou espaço de amostras) nessa parte de nosso projeto iremos fazer um estudo das features, possiveis correlações e principalmente remoção das features pouco significativas.


In [1]:
load_dataset <- function(filename) {
   # Load and return the content of `filename` into a DataFrame
   read.csv(filename)
}

In [2]:
dataset <- load_dataset("../dataset/xAPI-Edu-Data-full.csv")
head(dataset)


genderNationalITyPlaceofBirthStageIDGradeIDSectionIDTopicSemesterRelationraisedhandsVisITedResourcesAnnouncementsViewDiscussionParentAnsweringSurveyParentschoolSatisfactionStudentAbsenceDaysClass
M KW KuwaIT lowerlevelG-04 A IT F Father 15 16 2 20 Yes Good Under-7 M
M KW KuwaIT lowerlevelG-04 A IT F Father 20 20 3 25 Yes Good Under-7 M
M KW KuwaIT lowerlevelG-04 A IT F Father 10 7 0 30 No Bad Above-7 L
M KW KuwaIT lowerlevelG-04 A IT F Father 30 25 5 35 No Bad Above-7 L
M KW KuwaIT lowerlevelG-04 A IT F Father 40 50 12 50 No Bad Above-7 M
F KW KuwaIT lowerlevelG-04 A IT F Father 42 30 13 70 Yes Bad Above-7 M

In [3]:
str(dataset)


'data.frame':	480 obs. of  17 variables:
 $ gender                  : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 2 2 1 1 ...
 $ NationalITy             : Factor w/ 14 levels "Egypt","Iran",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PlaceofBirth            : Factor w/ 14 levels "Egypt","Iran",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ StageID                 : Factor w/ 3 levels "HighSchool","lowerlevel",..: 2 2 2 2 2 2 3 3 3 3 ...
 $ GradeID                 : Factor w/ 10 levels "G-02","G-04",..: 2 2 2 2 2 2 5 5 5 5 ...
 $ SectionID               : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 2 ...
 $ Topic                   : Factor w/ 12 levels "Arabic","Biology",..: 8 8 8 8 8 8 9 9 9 8 ...
 $ Semester                : Factor w/ 2 levels "F","S": 1 1 1 1 1 1 1 1 1 1 ...
 $ Relation                : Factor w/ 2 levels "Father","Mum": 1 1 1 1 1 1 1 1 1 1 ...
 $ raisedhands             : int  15 20 10 30 40 42 35 50 12 70 ...
 $ VisITedResources        : int  16 20 7 25 50 30 12 10 21 80 ...
 $ AnnouncementsView       : int  2 3 0 5 12 13 0 15 16 25 ...
 $ Discussion              : int  20 25 30 35 50 70 17 22 50 70 ...
 $ ParentAnsweringSurvey   : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 1 2 2 2 ...
 $ ParentschoolSatisfaction: Factor w/ 2 levels "Bad","Good": 2 2 1 1 1 1 1 2 2 2 ...
 $ StudentAbsenceDays      : Factor w/ 2 levels "Above-7","Under-7": 2 2 1 1 1 1 1 2 2 2 ...
 $ Class                   : Factor w/ 3 levels "H","L","M": 3 3 2 2 3 3 2 3 3 3 ...

Attributes

  1. Gender - student's gender (nominal: 'Male' or 'Female’)
  2. Nationality- student's nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
  3. Place of birth- student's Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
  4. Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)
  5. Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)
  6. Section ID- classroom student belongs (nominal:’A’,’B’,’C’)
  7. Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)
  8. Semester- school year semester (nominal:’ First’,’ Second’)
  9. Parent responsible for student (nominal:’mom’,’father’)
  10. Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)
  11. Visited resources- how many times the student visits a course content(numeric:0-100)
  12. Viewing announcements-how many times the student checks the new announcements(numeric:0-100)
  13. Discussion groups- how many times the student participate on discussion groups (numeric:0-100)
  14. Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)
  15. Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)
  16. Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)

Analisando agora cada uma das features


In [4]:
names(dataset)


  1. 'gender'
  2. 'NationalITy'
  3. 'PlaceofBirth'
  4. 'StageID'
  5. 'GradeID'
  6. 'SectionID'
  7. 'Topic'
  8. 'Semester'
  9. 'Relation'
  10. 'raisedhands'
  11. 'VisITedResources'
  12. 'AnnouncementsView'
  13. 'Discussion'
  14. 'ParentAnsweringSurvey'
  15. 'ParentschoolSatisfaction'
  16. 'StudentAbsenceDays'
  17. 'Class'

In [5]:
summary(dataset)


 gender     NationalITy       PlaceofBirth         StageID       GradeID   
 F:175   KW       :179   KuwaIT     :180   HighSchool  : 33   G-02   :147  
 M:305   Jordan   :172   Jordan     :176   lowerlevel  :199   G-08   :116  
         Palestine: 28   Iraq       : 22   MiddleSchool:248   G-07   :101  
         Iraq     : 22   lebanon    : 19                      G-04   : 48  
         lebanon  : 17   SaudiArabia: 16                      G-06   : 32  
         Tunis    : 12   USA        : 16                      G-11   : 13  
         (Other)  : 50   (Other)    : 51                      (Other): 23  
 SectionID     Topic     Semester   Relation    raisedhands    
 A:283     IT     : 95   F:245    Father:283   Min.   :  0.00  
 B:167     French : 65   S:235    Mum   :197   1st Qu.: 15.75  
 C: 30     Arabic : 59                         Median : 50.00  
           Science: 51                         Mean   : 46.77  
           English: 45                         3rd Qu.: 75.00  
           Biology: 30                         Max.   :100.00  
           (Other):135                                         
 VisITedResources AnnouncementsView   Discussion    ParentAnsweringSurvey
 Min.   : 0.0     Min.   : 0.00     Min.   : 1.00   No :210              
 1st Qu.:20.0     1st Qu.:14.00     1st Qu.:20.00   Yes:270              
 Median :65.0     Median :33.00     Median :39.00                        
 Mean   :54.8     Mean   :37.92     Mean   :43.28                        
 3rd Qu.:84.0     3rd Qu.:58.00     3rd Qu.:70.00                        
 Max.   :99.0     Max.   :98.00     Max.   :99.00                        
                                                                         
 ParentschoolSatisfaction StudentAbsenceDays Class  
 Bad :188                 Above-7:191        H:142  
 Good:292                 Under-7:289        L:127  
                                             M:211